graph LR
A["🔤 Statistical LMs<br/>(N-Grams)<br/>Predict by frequency"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
Architecture Fundamentals — Session 1
2026-02-23
Before we build with it — let’s define it.
AI that generates new content — text, images, code, audio — by learning patterns from existing data.
It doesn’t retrieve. It doesn’t look things up. It generates.
The Key Distinction
A search engine finds content that already exists. A generative model creates content that may have never existed before.
Text
Images
Audio & Video
Structured data
This Course
We focus on text generation — the foundation of every AI agent, chatbot, and assistant you’ll build.
A conceptual tour of the most important ideas.
The original goal was simple: predict the next word.
Statistical Language Models (N-Grams)
*“The capital of Saudi Arabia is ___“*
The model looked at the words before it and counted: what word most often follows “capital of Saudi Arabia is” in our corpus?
The question became: can we teach a model to understand words — not just count them?
graph LR
A["🔤 Statistical LMs<br/>(N-Grams)<br/>Predict by frequency"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
The machine learned to finish sentences — nothing more.
The next leap: instead of counting words, models learned to understand them.
Words became vectors — points in a mathematical space where meaning lives.
The Key Shift
Statistical LMs asked: “What word usually comes next?” (frequency)
Neural LMs asked: “What word means what comes next?” (understanding)
Now “king” and “queen” are close in that space. Arabic and English words for “house” are neighbors. Translation became possible.
More on this in RAG module.
graph LR
A["🔤 Statistical LMs<br/>(N-Grams)<br/>Predict by frequency"] --> B["🧠 Neural LMs<br/>Words as vectors<br/>Predict by meaning"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
Same task. Richer representation. The machine finally understood relationships between words.
Before 2017, models read text sequentially — word by word, left to right. (LSTMs)
The Transformer (“Attention is All You Need”, Google 2017) changed everything:
Why It Mattered
This is the “engine” inside every modern model — GPT, Claude, Gemini, Llama. Without it, scale was impossible.
Once we had the Transformer, we just needed to scale it up.
| Model | Year | Parameters |
|---|---|---|
| GPT-1 | 2018 | 117M |
| GPT-2 | 2019 | 1.5B |
| GPT-3 | 2020 | 175B |
The Revelation
At 175B parameters, GPT-3 could translate, summarize, write code, and answer questions — without being explicitly trained for any of them.
graph LR
A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
B --> T["⚡ Transformers<br/>(2017)<br/>Parallel Attention"]
T --> C["🚀 LLMs / GPT<br/>(2018–2020)<br/>Scale + Emergence"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
style T fill:#FFCC33,stroke:#1C355E,color:#333
style C fill:#FF7A5C,stroke:#1C355E,color:#fff
The task never changed. The scale did — and scale changed everything.
One mechanism. Many applications.
| Prompt prefix | What the model does |
|---|---|
"Once upon a time" |
Continues as a story |
"Translate to Arabic: Hello → " |
Completes the translation |
"# Python function that reverses a string\ndef reverse(" |
Writes the code body |
The Key Insight
Translation, summarization, and coding are all completion problems in disguise. The model doesn’t switch modes — it just completes the pattern.
graph LR
A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
B --> T["⚡ Transformers<br/>2017"]
T --> C["🚀 LLMs<br/>2020"]
C --> D["💬 ChatGPT<br/>(2022)<br/>Instruction-following<br/>(coming up next)"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
style T fill:#FFCC33,stroke:#1C355E,color:#333
style C fill:#FF7A5C,stroke:#1C355E,color:#fff
style D fill:#1C355E,stroke:#00C9A7,color:#fff,stroke-dasharray: 5 5
(Dashed = foreshadowed — we’ll complete this story in Section C)
How SFT turned completers into assistants
Base GPT-2: “Write me a poem”
Write me a poem about the weather
in a different language. Write me a
poem about the weather in a...
It continues the sentence. It doesn’t understand you issued a command.
ChatGPT: “Write me a poem”
Here's a poem for you:
The morning light breaks soft and low,
Across the hills where rivers flow...
It treats your text as an instruction and responds accordingly.
What changed?
Not the underlying completion mechanism — the training data and process.
Human annotators wrote thousands of correct instruction-following responses. The model was then trained to mimic them.
[Prompt] → "Summarize this article in 3 bullet points."
[Response] → "• Point 1\n• Point 2\n• Point 3"
[Prompt] → "Translate 'Good morning' to French."
[Response] → "Bonjour"
The model learned a new pattern: prompts are commands, not text to continue.
Still completion under the hood
SFT didn’t change the architecture. It changed what the model learned to complete — now it completes helpful assistant responses.
After SFT, a second step aligned the model with human preferences:
One-sentence intuition
RLHF is how we teach the model to prefer helpful and honest completions over merely plausible ones.
graph LR
A["Base Model<br/>(raw completion)"] --> B["Instruction-Tuned<br/>(follows commands)"]
B --> C["Chat-Optimized<br/>(multi-turn dialogue)"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
style C fill:#FF7A5C,stroke:#1C355E,color:#fff
Each step is an evolutionary refinement — same foundation, more alignment.
graph LR
A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
B --> T["⚡ Transformers<br/>2017"]
T --> C["🚀 LLMs<br/>2020"]
C --> D["💬 ChatGPT<br/>(2022)<br/>SFT + RLHF<br/>Instruction-following"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
style T fill:#FFCC33,stroke:#1C355E,color:#333
style C fill:#FF7A5C,stroke:#1C355E,color:#fff
style D fill:#1C355E,stroke:#00C9A7,color:#fff
The dashed node is now complete. We have a full completer → assistant pipeline.
No retraining required — just better prompts
You don’t need to retrain the model. Just show it examples in the prompt.
GPT-3 (2020) demonstrated that a large enough model could adapt to a new task from a handful of examples — with no gradient updates, no fine-tuning, no cost.
This was called In-Context Learning (ICL), and it unlocked prompt engineering as a discipline.
Zero-Shot
Classify the sentiment:
"The service was terrible."
Sentiment:
Model output: Negative
Works for simple, well-defined tasks.
Few-Shot
"Great product!" → Positive
"Terrible service." → Negative
"It was okay." → Neutral
"Absolutely loved it!" →
Model output: Positive
Works for nuanced or custom formats.
Rule of thumb
When zero-shot gives inconsistent results, add 2–3 examples. That’s usually enough.
The system prompt is the personality dial — it shapes every response without the user seeing it.
Change the system prompt → change the model’s persona, tone, and constraints. No retraining needed.
Four levers you control
Prompt engineering is not a workaround — it’s the primary interface between your application and the model.
From chain of thought to o1 and beyond
In 2022, researchers discovered a single phrase nearly doubled accuracy on math benchmarks:
Direct answer
Q: Roger has 5 tennis balls.
He buys 2 more cans of 3.
How many does he have?
A: 11
Wrong. (The model guessed.)
With Chain of Thought
Q: Roger has 5 tennis balls.
He buys 2 more cans of 3.
How many does he have?
Let's think step by step.
A: He starts with 5.
2 cans × 3 balls = 6 more.
5 + 6 = 11.
Correct — and verifiable.
Forcing the model to generate intermediate steps means:
Intuition
The model doesn’t think before it writes — it thinks by writing. More tokens = more thinking.
graph LR
A["Chain of Thought<br/>(2022, prompt trick)"] --> B["o1<br/>(2024, trained CoT)"]
B --> C["DeepSeek-R1<br/>(2025, open weights)"]
C --> D["Claude Extended<br/>Thinking (2025)"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
style C fill:#FF7A5C,stroke:#1C355E,color:#fff
style D fill:#1C355E,stroke:#1C355E,color:#fff
The shift: from prompting a model to think step-by-step, to training a model that reasons internally before responding.
| Standard Model | Thinking Model | |
|---|---|---|
| Response time | ~1 second | ~15–30 seconds |
| Cost | Low | 3–10× higher |
| Best for | Chat, summarization, extraction | Math, coding, complex reasoning |
| Failure mode | Hallucination | Overthinking simple tasks |
When to use thinking models
Use them when correctness matters more than speed — audit tasks, code generation, structured data extraction. Don’t use them for real-time chat.
graph LR
A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
B --> T["⚡ Transformers<br/>(2017)"]
T --> C["🚀 LLMs / GPT<br/>(2020)"]
C --> D["💬 ChatGPT<br/>(2022)"]
D --> E["🤔 Thinking Models<br/>(2024–2025)"]
style A fill:#9B8EC0,stroke:#1C355E,color:#fff
style B fill:#00C9A7,stroke:#1C355E,color:#fff
style T fill:#FFCC33,stroke:#1C355E,color:#333
style C fill:#FF7A5C,stroke:#1C355E,color:#fff
style D fill:#1C355E,stroke:#00C9A7,color:#fff
style E fill:#9B4DCA,stroke:#1C355E,color:#fff
From counting words to reasoning step-by-step — one continuous thread.
The mechanics behind what you’ve been using all session
The machine doesn’t actually see “words.” It sees Tokens.
Rule of Thumb
1,000 tokens ≈ 750 words. When you’re billed for “input and output,” you’re paying for these atomic fragments.
The model doesn’t “know” facts; it calculates probabilities.
“According to Article 214 of the…”
The Decision
It doesn’t always pick the top token. We use Sampling Parameters to control how it chooses from this list.
Temperature adjusts the shape of that probability curve.
Low Temp (< 0.5) - Sharpened probabilities
High Temp (> 0.8) - Flattened probabilities
Top-K simply chops off the list after \(K\) items.
“According to Article 214…”
Impact
Here, setting \(K=2\) forces it to stick to the very most likely words, preventing weird deviations like “Unicorn”.
Top-P sums up probabilities until it hits \(P\) (e.g., 0.9).
Scenario A: The Law (Ambiguous)
Scenario B: The Capital (Certain)
“The capital of Saudi Arabia is…”
Why Top-P wins
It adapts to the context. It’s strict when the answer is clear (Riyadh), and flexible when it’s open-ended or we have synonyms.
| Parameter | Effect | Recommended Value |
|---|---|---|
| Temperature | Randomness | 0.0 (Fact/Code) – 0.7 (Chat) – 1.0 (Creative) |
| Top-P | Dynamic vocabulary | 0.9 (Standard), 1.0 (Disable) |
| Top-K | Hard vocabulary limit | 40-100 (Standard) |
Best Practice
Usually, you tune Temperature and Top-P. Leave Top-K alone or at default.
Hallucinations are structural. They aren’t a bug; they are the result of the model doing its job without enough context.
Production Risk
A hallucinating chatbot creates real legal and financial liability. “Confident, fluent lies” are harder to spot than obvious errors.
| Strategy | How It Works | When to Use |
|---|---|---|
| RAG | Ground responses in retrieved documents | Most enterprise apps (Module 03) |
| Output Validation | Verify against structured data/rules | Factual claims, numbers |
| Temp = 0 | Force the most likely (factual) path | extraction, classification |
| Human-in-the-Loop | Human review before action | Critical decisions |
The Core Lesson
The same mechanism that writes a creative novel also writes a confident lie. Control the mechanism, or provide the data.
What you know and what’s next
This session focused on intuition. We deliberately left out:
| Skipped | Why | When you’ll need it |
|---|---|---|
| Transformer architecture | You don’t need it to build | Academic curiosity only |
| Attention math | Same reason | Deep ML research |
| KV-cache mechanics | Not relevant yet | Session 2 — it affects your API bill |
The principle
Understand the behavior of the model before its internals. Once you’re building with the API, the internals become relevant — and we’ll cover them exactly when they matter.
Up Next
Lab 1: Hands-on tokenization and cost analysis — see how tokens, context windows, and costs connect in practice.